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' Canonical correlation analysis is a statistical technique that is used to find relations between two 

sets of variables. An important extension in pattern analysis is to consider more than two sets of 



variables. This problem can be expressed as a quadratically constrained quadratic program (QCQP), 
l/^ ' commonly referred to Multi-set Canonical Correlation Analysis (MCCA). This is a non-convex problem 

and so greedy algorithms converge to local optima without any guarantees on global optimality. In this 
paper, we show that despite being highly structured, finding the optimal solution is NP-Hard. This 
motivates our relaxation of the QCQP to a semidefinite program (SDP). The SDP is convex, can be 
I— I , solved reasonably efficiently and comes with both absolute and output-sensitive approximation quality. 

C/3 ' In addition to theoretical guarantees, we do an extensive comparison of the QCQP method and the SDP 

O I relaxation on a variety of synthetic and real world data. Finally, we present two useful extensions: we 

incorporate kernel methods and computing multiple sets of canonical vectors. 

T— I ! 

'■ 1 Introduction 

0^ . Natural phenomena are often the product of several factors interacting. A fundamental challenge of pattern 

' analysis and machine learning is to find the relationships between these factors. Real world datasets are 

04 , often modeled using distributions such as mixtures of Gaussians. These models often capture the uncertainty 

' inherent in both underlying systems and measurements. Canonical correlation analysis (CCA) is a well- 

known statistical technique developed to find the relationships between two sets of random variables. The 
relations or patterns discovered by CCA can be used in two ways. First, they can be used to obtain 
a common representation for both sets of variables. Second, the patterns themselves can be used in an 
^> ' exploratory analysis (see fll| for example). 

^ , It is possible to extend this idea beyond two sets. The problem is then known as the Multi-set Canonical 

Correlation Analysis (MCCA). Whereas it can be shown that CCA can be solved using an (generalized) 
eigenvalue computation, MCCA is a much more difficult problem. One approach is to express it as a non- 
convex quadratically constrained quadratic program (QCQP). In this paper, we show that despite being 
a highly structured problem, it is NP-hard. We then describe an efficient algorithm for finding a locally 
optimal solutions to the problem. 

Since the algorithm is local and the problem non-convex, we cannot guarantee the quality of the solutions 
obtained. Therefore, we give a relaxation of the problem based on semi-definite programming (SDP) which 
gives a constant factor approximation as well as an output sensitive guarantee. 
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For use in practical applications, we describe two important extensions: we adapt the methods to use 
kernels and to find multi-dimensional solutions. 

Finally, we perform extensive experimentation to compare the efficient local algorithm and the SDP 
relaxation on both synthetic and real-world datasets. Here, we show experimentally that the hardness of 
the problem is in some sense generic in low dimensions. That is, a randomly generated problem in low 
dimensions will result in many local maxima which are far from the global optimum. Somewhat surprisingly, 
this does not occur in higher dimensions. 

Our contributions in this paper are as follows: 

• We show that in general MCCA is NP-hard. 

• We describe an scalable and efficient algorithm for finding a locally optimal solution. 

• Using an SDP relaxation of the problem, we can compute a global upper bound on the objective 
function along with various approximation guarantees on solutions based on this relaxations. 

• We describe two extensions which are important for practical applications: a kernel method and 
computing multiple sets of canonical vectors. 

• An extensive experimental evaluation of the respective algorithms: we show that in practice the local 
algorithm performs extremely well, something we can verify with using the SDP relaxation as well as 
show there are cases where the local algorithm is far from the optimal solution. We do this with a 
combination of synthetic and real world examples. 

• We propose a preprocessing step based on random projections, which enables us to apply the SDP 
bounds on large, high dimensional data sets. 

The paper is organized as following. Section [2] describes the background and related work. Section [3] 
introduces the main optimization problem, discusses the problem complexity and presents several bounds 
on optimal solutions. Section 2] describes the extensions of the original formulation to higher-dimensional, 
nonlinear case. Section [5] presents empirical work based on synthetic and real data. Conclusions and future 
work is discussed in section [6l Finally, in the Appendix we included a primer for the notation used in the 
paper. 

2 Background 

Canonical Correlation Analysis (CCA), introduced by Harold Hotelling [16], was developed to detect linear 
relations between two sets of variables. Typical uses of CCA include statistical tests of dependence between 
two random vectors, exploratory analysis on multi-view data, dimensionality reduction and finding a common 
embedding of two random vectors that share mutual information. 

CCA has been generalized in two directions: extending the method to finding nonlinear relations by using 
kernel methods [T7] [H] (see [17] for an introduction to kernel methods) and extending the method to more 
than two sets of variables which was introduced in [18] . Among several proposed generalizations in |18j the 
most notable is the sum of correlations (SUMCOR) generalization and it is the focus of our paper. There 
the result is to project m sets of random variables to m univariate random variables, which are pair-wise 
highly correlated on averag^ll- An iterative method to solve the SUMCOR generalization was proposed in 
|15) and the proof of convergence was established in [5] . In [2] it was shown that there are exponentially many 
solutions to a generic SUMCOR problem. In [ij and [29] some global solution properties were established for 
special families of SUMCOR problems (nonnegative irreducible quadratic form). In our paper we show that 
easily computable necessary and sufficient global optimality conditions are theoretically impossible (which 
follows from NP-hardness of the problem). Since in practice good local solutions can be obtained we will 
present some results on sufficient global optimality. 

^For m univariate random variables, there are (™) pairs on which we measure correlation 
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Figure 1: The bloek strueture of the random vector X and the corresponding covariance block structure. 



We also focus on extensions of the local iterative approach [T5| to make the method practical. Here 
we show how the method can be extended to finding non-linear patterns and finding more than one set of 
canonical variates. Our work is related to [20] where a deflation scheme is used together with the Newton 
method to find several sets of canonical variates. Our nonlinear generalization is related to |28| . where the 
main difference lies in the fact that we kernelized the problem, whereas the authors in |28l worked with 
explicit nonlinear feature representation. 

We now list some applications of the SUMCOR formulation. In [21] an optimization problem for multi- 
subject functional magnetic resonance imaging (fMRI) alignment is proposed, which can be formulated as a 
SUMCOR problem (performing whitening on each set of variables) . Another application of the SUMCOR 
formulation can be found in [20| . where it is used for group blind source separation on fMRI data from 
multiple subjects. An optimization problem equivalent to SUMCOR also arises in control theory |23| in the 
form of linear sensitivity analysis of systems of differential equations. 



3 Sum of correlations optimization problem 

Before stating the problem, we must introduce some notation and context. Assume that we have a random 
vector X distributed over M.^ , which is centered: E {X) ~ 0. Let C := Cov {X , X) denote the covariance 
matrix of X. 

Throughout the paper we will use the block matrix and vector notation. The block structure 

b := (711, . . . , n,„) , ^ 6 (i) = iV 

i 

denotes the number of elements in each of m blocks. Sub-vectors according to the block structure b are 
denoted as X'^'^^ G M"' (i-th block-row of vector X) and sub-matrices as C'^''-''' G M"'^"j (i-th block-row, j-th 
block column of matrix C); see Figure [TJ 
For a vector, w G R^, define 



Z.i is a random variable computed as linear combination of components of A"*^*-* . 
Let p (x, y) denote the correlation coefficient between two random variables, 

^Cov {x,x) Cov {y,y) 
The correlation coefficient between Zi and Zj can be expressed as: 



p{Z,,Zj) = 



Vi(;(Orc(i,i)^i,(i)\/M;0)TCO.j)wO) 
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We are now ready to state the problem. In this paper, we deal with the problem of finding an optimal set 
of vectors w*^*) which maximize 

J2 E Pi^'^'^i) (SUMCOR) 

This is a generalization of Canonical Correlation Analysis where 171 — 2. We refer to this problem as Multi-set 
Canonical Correlation Analysis (MCCA). We refer to each A''*^ as a particular view of some underlying object 
with the assumption is that random vectors A"'*^ share some mutual information (i.e. are not independent). 
The original sum of correlations optimization problem is: 



The solution to the optimization problem, the set of components, . . . , w^™)) , are referred to as the 

set of canonical vectors. Observe that the solution is invariant to scaling (only the direction matters): if 
(w*-^-* , . . . , w'"'^) is a solution, then (ai • w'-^-' w*^™^) is also a solution for ai > 0. This means 
that we have the freedom to impose the constraints u'''-'"'"C^*''-'ti;^*-' = 1, which only affect the norm of the 
solutions. We now arrive to an equivalent constrained problem: 



m m 



maximize w^'^^C*^*'-'^?/;^''^ 

i=l 

subject to u;(*)^C(*^*)wW 1, Vi = l,...,m. 



We proceed by multiplying the objective by 2 and adding a constant m, which does not affect the optimal 
solution. Using the equalities: w^^^-'^C'*'-'-' w*^^-* = w'^^^'^ C'^^ '^^ w'^'^^ and = i -^e arrive at: 



maximize w^'^'^C^"'-'^ w^^^ 

subject to u;(*)^C(''*)w(*) =1, Vi = 1 



This allows us to consider the summation as the quadratic form nFCw. 

Let C^*'*^ is strictly positive definite, the Cholesky decomposition (7'^*'*' — Dj Di exists. Using the 
substitution Wi := DiWi and defining A e such that 

A('J) := Dr'^C^'^^^D-^ 

Here the block structure b is used. As a consequence of the substitution, we have that A'*'*) = /^(i), where / 
denotes the b (i)-by-& (i) identity matrix. Using block vector notation, let , Vi = 1, . . . , TO. 

The form of the optimization problem we will finally consider is: 



max x"'" Ax 



subject to x'^^'^cct'^ =1, Vi = 1, 



(QCQP) 



. ,TO, 



where b := (rii, . . . , n^) encodes the block structure, A e S^, A^*''^ = /f,(i). 

We started with a formulation (|SUMCOR[) and arrived to ( |QCQP[ ). The last optimization problem is 
simpler to manipulate and will be used to prove the complexity result of (jSUMCORj) . as well as to obtain 
a relaxed version of the problem along with some useful bounds. We will also state a local-optimization 
approach to solving (jSUMCORp based on the ( |QCQP[ ) problem formulation. 
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3.1 NP-Hardness 



First, we give a reduction to show that this optimization problem is NP-hard. We use a reduction from a 
general binary quadratic optimization (BQO) problem. 

Let A G R™^™ the binary quadratic optimization (BQO) problem is 

max x"^ Ax 

(BQO) 

subject to X (i) =1, Vi = l,...,m 

Many difficult combinatorial optimization problems (for example maximum cut problem and maximum 
clique problem [5]) can be reduced to BQO [3], which is known to be NP-hard. 

We will show that each BQO problem instance can be reduced to an instance of the problem ( |QCQP[ ). 
That means that even though the problem ( |QCQP[ ) has special structure (maximizing a positive-definite 
quadratic form over a product of spheres) it still falls into the class of problems that are hard (under the 
assumption that P ^ NP). The idea is to start with a general instance of BQO and through a set of simple 
transformations obtain a specific instance of ( |QCQP[ ), with a block structure b ~ (1,...,1). The simple 
transformations will transform the BQO matrix A into a correlation matrix, where the optimal solutions 
will be preserved. 

Let us start a BQO with a corresponding general matrix A e K™-. Since x^Ax = x^i^x we 
can assume that the matrix A is symmetric. The binary constraints imply that for any diagonal matrix D 
the quantity x^ Dx ~ D is constant. This means that for c > large enough, we can replace the 
objective with an equivalent objective x'^ {A + c ■ I) x which is a positive-definite quadratic form. If wc set c 
to ll^lli -l- 1, it guarantees strong diagonal dominance, a sufficient condition for positive definiteness. From 
now on, we assume that the matrix A in the BQO is symmetric and positive-definite. Let g — max^ A {i, i) 
and let D e K™^™ be a diagonal matrix with elements D ^ g — A 

Then the BQO is equivalent to: 

t{A + D) 
max X x 

xm^ g (3) 

subject to X (i)^ = 1, Vi = 1, . . . , m. 

The matrix ''^"^'^^ is a correlation matrix since it is a symmetric positive-definite matrix with all diagonal 
entries equal to 1. The optimization problem corresponds to a problem of maximizing a sum of pairwise 
correlations between univariate random variables (using block structure notation: b{i) = l,Vi ~ 1, . . . ,m). 
This shows that even the simple case of maximizing the sum of correlations, where the optimal axes are 
known and only directions need to be determined, is a NP-hard problem. 



3.2 Local solutions 

Given that the problem is NP-hard and assuming P ^ NP, it is natural to use local methods to obtain a 
perhaps suboptimal solution. In this section, wc give an algorithm that provably converges to locally optimal 
solutions of the problem ( |QCQP[ ), when the involved matrix A is symmetric and positive-definite and generic 
(seei). 

The general iterative procedure is given as Algorithm [TJ 

The algorithm can be interpreted as a generalization of the power iteration method (also known as the 
Von Mises iteration), a classical approach to finding the largest solutions to the eigenvalue problem Ax — Xx. 
If the number of views m = 1, then Algorithm [T] exactly corresponds to the power iteration. Although the 
algorithm's convergence is guaranteed under the assumptions mentioned above, the convergence rate is not 
known. In practice we observe linear convergence, which we demonstrate in figures I2a[ I2bl In figure [5a| we 
generated 1000 randorrH instances of matrices A with block structure b = (2, 2, 2, 2, 2) and for each matrix 

^We used random Gram matrix method to generate random problem instances; the method is described in section [5. II 
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Algorithm 1 Horst algorithm 

Input: matrix A e block structure b = (ni, . . . , n^), initial vector .tq G with > 0, 

X ^ xq; 

for iter" = 1 to maxiter do 
X 4— Ax; 

for i = 1 to m do 



end for 
end for 
Output: X 



we generated a starting point xq and ran the algorithm. The figure depicts the solution change rate on a 
logarithmic scale {logio ^^'^'jj!^^'^^^ )■ Wc observe linear convergence with a wide range of rates of convergence 
(slopes of the lines) . Figure I2bl shows the convergence properties for a fixed matrix A with several random 
initial vectors xq- The problem exhibits a global (reached for 65% initial vectors xq) and a local solution 
(reached in 35% cases), where the global solution paths converge faster (average global solution path slope 
equals —0.08, as opposed to —0.05 for the local solution paths). 





Figure 2: (a) Convergence plot (1000 random matrices A, one random xq per problem instance) (b) Con- 
vergence plot (single random A, 1000 random initial vectors xq) 



3.3 Global analysis 

The above algorithm is highly scalable and as we shall see in Section [5] often works well in practice. How- 
ever the algorithm may not converge to a globally optimal solution. We will first present the semidefinite 
programming relaxation of the problem in 13.3.11 There we will show how the SDP solution can be used 
to extract solution candidates for the original problem. We will prove a bound that relates the extracted 
solution quality and the optimal QCQP objective value. 

Wc will then present a set of upper bounds on the optimal QCQP objective value 13.3.21 Such bounds 
can be used as certificates of optimality (or closeness to optimality) of local solutions (obtained by the local 
iterative approach for example). 
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3.3.1 Semidefinite programming relaxation 

SDP Relaxation Let A,Bi,..., B„i G R^^^ share the block structure b := (ni, . . . , n„J , J2t b («) = N. 
The blocks sf'''^ e M"'-><"' for i, fc, / = 1, . . . , m are defined as: 

^{kd) f IfH --k = i,l 
* ■ \ Okj : e/se 

where 1^ & ig an identity matrix and 0^^; S ffi*^'' is a matrix with all entries equal to zero. 

Using the fact that x^Ax = trace (Axx"^) we can rewrite the problem ( |QCQP[ ) as: 



max trace (A. 



XX 

^ ' (QCQP2) 
subject to trace (^Bixx^) =1, Vi = 1, . . . , m 

We can substitute the matrix xx'^ with a general matrix X G §" constrained it to being rank-one: 

maximize trace (AX) 

subject to trace {BiX) = 1, Vi = 1, . . . , m 
rank(X) = 1. 

Remark Matrices A and Bi, . . . , B,n are symmetric positive-semidefinite matrices. 
By omitting the rank-one constraint we obtain a semi-definite program in standard form: 

max trace {AX) 

subject to trace (-BiX) = 1, Vi = 1, 



Remark If the solution of the problem (|SDP[) is rank-one, i.e. X can be expressed duS X = y ■ y'^ , then y is 
the optimal solution for ( |QCQP[ ). 

Low rank solutions In the following subsection we will show how to extract solutions to QCQP from 
solutions of the SDP problem. We will present a bound that relates the global SDP bound, the quality of 
the extracted solution and the optimal value of QCQP. The bound will tell us how the extracted solution 
gets close to the optimal QCQP solution when the SDP solution is close to rank 1. 

Let X* be a solution to the problem (|SDPp and let x* be the solution to the problem ( |QCQP[ ). Then 
the following inequality always holds: 



trace {AX*) > trace (A ■ 



X ■ X 



An easy way to extract a good feasible solution to the problem ( |QCQP[ ) from X* is to project its leading 
eigenvector to the set of constraints. Let b — {rii, . . . , n™) , J^i ~ N denote the block structure. 

Let y g R^, ^ 0. The projection of vector y to the feasible set of the problem ( |QCQP[ ) is given by 

map TT (•) : R^, defined as: 



71" (y) := 



y(l) y(™) 



The quality of the solution depends on spectral properties of matrix X and matrix A. 

Assumption 3.1 Let b = (ni, . . . , rim) denote the block structure and Hi ~ N . Let X* be the solution 
to the problem ^SDP\) . Let Xk denote the k-th eigenvector of X* . The assumption is the following: 

\\xf^\\ > 0,Vi = l,...,m. 
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Conjecture 3.2 Assumvtion \3.1\ holds in general for optimal solutions to the problem IISDP\) . 

The assumption makes the projection to the feasible set, 7r(-), weh defined. In our experiments, this was 
always true, but we have been unable to find a proof. The result that we will state after the next lemma is 
based on the projection operator and thus relies on the assumption. 

Lemma 3.3 Let b = (tii, . . . , denote the block structure and — solution to 

problem IISDP\) . Let Xk denote the k-th eigenvector of X* . Let ai := y^tjz- If X* can be expressed as: 

X* = Xixix^ + X2X2X2 , 
where Xi and X2 have unit length and Ai > 1 > A2, then 

Ai < aiaj < — . 

i — A2 

Proof The constraints in problem (jSDP[) are equivalent to: 

Ailkf^f + A2||4''^|P = 1,VZ=1,...,TO. 

Since A2 < 1 and ||4*^ll^ < 1 it follows that < A2||a:2'^ll^ < 1- ^ follows that: 



Ai Ai 



Since ai = -hj it follows that 



and finally: 



I-X2 

Ai < aiaj < ^\ ,Vi,j = 1, . 

i — At 



Proposition 3.4 Let b = (ni, . . . ,nm) denote the block structure and Hi = N. Let X* be the solution 

to the problem (SDP\) and x* be the solution to the problem \QCQF^ . Let Xk denote the k-th eigenvector of 

X* . Let ai := — pj— . Let i/j :— trace (AX*), (f) :~ trace (A • x* ■ x*^^ . It is obvious that: 
W^i II 

?/>>(/)> TT (xi) . 

If X* can be expressed as: 

X* ~ Xixix^ + X2X2X2 , 
where xi and X2 have unit length and Ai > 1 > A2, then: 

^^-7t{x,)< f-l^-l\m'- + X2\\A\\2. 
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Proof 



id 

-^(Ai • +A2IIAII2 < 

Al) • V^+A2P||2< 



< 



< 



< 



Ai 

I-A2 

Ai 

I-A2 
1 

1- A2 



-Aij ._ + A2l|Al|2 

- 1 ) ■m^ + X2\\A\\2 



A similar bound can be derived for the general case, provided that the solution to problem (|SDPp is close 
to rank-one. 

Proposition 3.5 Let b = (ni, . . . ,nm) denote the block structure and Ui = A^. Let X* be the solution 
to the problem liSDP\} and x* be the solution to the problem liQCQP\ ). Let x^ denote the k-th eigenvector of 

X* . Let ai :— — rjj— . Let ib := trace (ylX*). If X* can be expressed as: 

W^i II 

X* = Aixixf + A2a;2.Tf H + A„a;„a;^, 

where each Xi has unit length and Ai > 1 > ^ A^, then: 

i—2 n 



Ip -TT (xi) 
I 



< 



1- E A, 

i—2,. ...n 



I i=2,...,n 



3.3.2 Upper bounds on QCQP 

This subsection will present several upper bounds on the optimal QCQP objective value. We will state 
a simple upper bound based on the spectral properties of the QCQP matrix A. Wc will then bound the 
possible values of the SDP solutions and present two constant relative accuracy bounds. 
L2 norm bound We will present an upper bound on the objective of ( |QCQP[ ) based on the largest eigenvalue 
of the problem matrix A. 

Proposition 3.6 The objective value of ^QCQP\ ) is upper bounded by m ■ ||j4||2. 

Proof The problem ( |QCQPp remains the same if we add a redundant constraint x'^x = m obtained by 
summing the constraints E"=i (x''^^^ x''^^ ~ l) =0- We then relax the problem by dropping the original 
constraints to get: 

max x^ Ax 

subject to x'^ X = m. 

Since ||^||2 = max^^^^^^^ix'^ Ax it follows that the optimal objective value of (|4]) equals m ■ \\A\\2- | 
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Bound on possible SDP objective values 

Lemma 3.7 Let X* be the solution to the problem HSDF]) and let ip trace (AX*). Then 

m < ip < . 

Proof Express X* as: 

X* — ^ ^ XiXixf , 

2— l,...,n 

where each Xi has unit length and Ai > . . . > Xn > 0. The lower bound follows from the fact that upper 
bounds the optimal objective value of problem ( |QCQP[ ) which is lower bounded by m. The lower bound 
corresponds to the case of zero sum of correlations. 

To prove the upper bound first observe that the constraints in (jSDPp imply that ^ A; — m. Let 

2— l,...,n 

y e and let ||?;||2 = 1. Let z := . . . , . Observe that ||z||2 = 1 and that ||zz^||2 = 1. 

Define e € M™, e (i) = 1, Vi = 1, . . . , m. We will now bound ||A||2: 

y^Ay = 2/(^^^A(^J)2/(j") 



— l,...,m 



^ < 



J2 \\y'''\\\\y 



< \ ll„(0||||„(i) 



e (zz )e < ||e|| • \\zz-^ ||2 • |le|| = m. 



We used the fact that i^/iTj^''*'"''' Y^UTJ ^ correlation coefficient and thus bounded by 1. The upper bound 
follows: 



trace (AX*) = trace j A E] XiX'^''x 



2— 



Constant relative accuracy guarantee We now state a lower bound on the ratio between the objective 
values of the original and the relaxed problem that is independent on the problem dimension. The bound 
is based on the following result from [22] . stated with minor differences in notation. Let Square(-) denote 
componentwise squaring: if y = Square(a;) then y{i) = x{if' and let diag(X) denote the vector corresponding 
to the diagonal of the matrix X . 

Theorem 3.8 Let A G M^^^ be symmetric and let T be a set with the following properties: 

• J- is closed, convex and bounded. 

• There exists a strictly positive v ^ J- . 

• J- ^ {v ^ K : Bv = c} , where K is a convex closed pointed cone in with non-empty interior, 

pkxN 



B e W''^ , c ^ Ofc and {v e intK : Bv ^ c} ^ ' 
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Let 

(j)* :— max Ax : square (a;) G J^} , 
:= min \^x^Ax : square (x) G J^} , 
jp* := max {trace (AX) : diag (X) e J", X G 
ip., := min {trace (^X) : diag (X) e J", X e §+ } , 
•0 (a) := aip* + (1 — a) ?/'*■ 

V'* < 0* < V' ( 1 - - ) < 'Z' ( - ) < 0* < 0*- 



Theorem 3.9 Let x* be the solution to the problem \QCQP^ and X* be the solution to the problem \SDP\) . 
Let b = (tii, . . . , n„i) denote the block structure where X^i"* ^ ^* trace (A • a;* -a;*"^), ip* := 

trace {AX*). 
Then 

2 

--0* <</>*< 1p*- 



TT 



Proof We first note that tp^, > 0, since A e S!}:. This foUows from the fact tliat trace (AX) > for any 
X G (and thus for the minimizer X^,). The positiveness of the trace can be deduced from: trace {AX) ~ 
trace {CaC^CxC^) = trace (C^CaC^Cx) = WC^CxWl- > 0, where Ca and Cx are Cholesky factors of 
matrices A and X respectively. 

We now show that the problems ( |QCQP2D and (ISPP]) can be reformulated so that the theorem 13.81 
applies. 

First wc note that the feasible sets in ( |QCQP2] ) and (jSDPp are defined in terms of equalities. Without loss 
of generality we can replace them with inequality constraints: x'-^'-^x^'^ < 1 in ( |QCQP2] ) and trace {BiX) < 1 
in (|SDPp . The feasible sets defined by the inequalities are convex and bounded. Since the objective functions 
in both problems are convex, it follows that the optima lie on the border. 

Next, we add redundant constraints to the two problems respectively: Square (a;^*^) > 0, Vi = 1, . . . , m 
and X(j,j) > 0,Vj = l,...,iV. 

Define T = {x G R^\x^^^ € A"-"!}, where 



A''^ = ja; e M''^+i|a;(i) > 0,Vi and ^a;(i) = l| 



is a product of standard simplices: = YViLi ^- It follows that the set is closed, bounded and convex. 
J- can be embedded in M^+i in order to obtain a conic formulation. 

K = {t- [1 x^]^\t>0,xeT} 
B=[10n]^, c=1, J- = Kn{x\Bx^c}. 

Define v = [vf . . . v'^]'^ , where vi {j) = The vector [l f"^] is strictly positive and lies in int {K) n 

{x e M^+i|Ba: = c}. Let 1 G M^+i be defined as A (1, i) = 0, A{i, 1) = 0, Vi and l(i, j) = A{i-l,j- 1) ,Vi,j > 
1. 

The optimization problem ( |QCQP2| is equivalent (with the same optimal objective value) to: 



subject to Square {x) G F 



max trace 



11 



a = - we ffet the desired result: 

7r ^ 



The optimization problem (|SDP[) is likewise equivalent to the problem: 

max trace ( AX ] 

subject to diag {X) g F 

Using the definition of ip (a) and the fact that ip^, > it follows that i/j (a) > aip* ,ya > 0. Substituting 

2 * 

—ip < (j) < ip . 

TT 



Observe that the bound above relates the optimization problems ( |QCQP[ ) and (jSDPp and not ([T|) with 
its SDP relaxation. Let (p denote the optimum value of the objective function in ([T]) and let ip denote 
the optimum value of the objective function of the corresponding SDP relaxation. It is easy to see that 
2 ■ (p + m = (p and 2 ■ ijj + m = ip, which is a consequence of transformations of the original problems to 
their equivalent symmetric positive-definite problems. The constant relative accuracy bound becomes a 
bit weaker in terms of the original problem and its relaxation. This fact is stated in the following corollary. 

Corollary 3.10 

~ 2~ (l-^)m 

TT I 



Improved bound on the relative accuracy We can exploit additional structure of the problem to obtain 
a slightly better bound. We use the same conventions as [55] . 

Define 

u (/?) 13 arcsin { jS) + ^/l - l3^. 

The function w (/3) is increasing and convex with a; (0) = 1 and w (1) = ^. 
By theorem 3.1, item 1 in [55] we obtain the result: 

(2 f m\ 
max|-w f — j ,— >4'*<(p*<ip*. 

This results in a minor improvement of the default bound. For example when m = 3 and the fact that 
^ > I we obtain the following: 

2 ,* 105 2 ,^ 

<jTr^--lp <(p <lp 
TT 100 TT 

4 Sum of correlations extensions 

In this section we discuss two extensions of MCCA. By using kernel methods we show how to find nonlinear 
dependencies in the data. We then present an extension of the method to finding more then one set of 
correlation vectors. 



4.1 Dual representation and kernels 

We return to the formulation ([1}: 



max y y u;(*)^C(*'^')«;(^') 

2 — 1 J— 2+1 

subject to u;W^C(''*)w« = 1, Vi = l,...,m, 
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where b = (ni, . . . , n„i) denotes the block structure and J2i = ^- ^he previous sections we focused on 
manipulating covariance matrices only and omitted details on their estimation based on finite samples. In 
this section we will use a formulation that explicitly presents the empirical estimates of covariances, which 
will enable us to apply kernel methods. Let X he a, random vector distributed over with E {X) = 0. 
Let X G R^^* represent a sample of s observations of X, where each observation corresponds to a column 
vector. Empirical covariance of X based on the sample matrix X is expressed as: 

1 



Cov (X) = XX^ . 

s — 1 

In case the number of number of observations, s, is smaller than the total number of dimensions A^, the 
covariance matrix Cov {X) is singular. This is problematic both from a numerical point of view and it leads 
to ovcrfitting problems. These issues are addressed by using rcgularization techniques, typically a shrinkage 
estimator Cov {X)^ is defined as: 



1 



Gov {X)^ = (1 - k) XX^ + kI 



N, 



where k e [0, 1]. 

Using the block structure b, ([2]) becomes: 



mm 
* i=l j=i+l 

subject to + ,«/^^ = 1, 

Vz = 1, . . . , m. 

We will now express each component w'-'-' in terms the columns of X'-^\ Let w have block structure 
bw = (,ni, . . . , Um) where rii = iV, and let y g M™'^ have block structure by (i) = s, Vi = 1, . . . , m. 



max 



= > ;y«(j)X«(:,j)=xWy«, (6) 



We refer to y as dual variables. 

Lemma 4.1 Solutions to the problem ([5]) can be expressed as 

Proof We will prove this by contradiction. Let u be the optimal solution to ([S]). Assume that u^^'' doesn't 
lie in the column space of 

where 

z_L ^ On, and X'^^'>'^zi_ = 0^. 
Then li defined as w^*^ = u^'^\\/i > 1 and u'-'^ = ^X'-'^^y^-'^', where 



strictly increases the objective function, which contradicts u being optimal. Clearly u is a feasible solution. 
Positive definiteness of i^X^^^X*^^'^ + kI^ coupled with the fact that zjzj^ > implies that < 7 < 1. 

Assume without loss of generality that X]jL2 {X'-^^y^^'^)'^ X'^^^X'^^'^'^u'-^'^ > (The negative sum would lead 
to another contradiction by taking u*-^-* = —u^^\ If the sum was zero, then any properly scaled (with 
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proper sign) combination of the training data X*^^) could be used in place of w*^^^). The following inequality 
completes the proof: 



-.171 „ 



s 

J=2 



Let /Vj = e R"^^ denote the Gram matrix. We now state regularized covariance formulation 

in terms of the dual variables: 



-. mm 



max 



i—1 

subject to y(')^ ( ^—^K,Kf + kK^ y^*' = 1, '''^^ 



s - 1 
Vi = 1, . . . , m. 

The problem is reformulated in terms of Gram matrices based on the standard inner product. This 
formulation lends itself to using kernel methods (see [27]) which enable discovering nonlinear patterns in the 
data. 

Typically the matrices Ki are ill conditioned (even singular when the data is centered) and it is advan- 
tageous to constrain the magnitude of dual coefficients as well as the variance in the original problem. We 
address this by introducing a first order approximation to the dual regularized variance. Let 



Then: 

Gov (A-W)^ = ^^KJ<J + nK, « %Ki^ . 

/A. S — 1 

The approximation that has two advantages: it is invertible and factorized, which we exploit in obtaining a 
convergent local method. The final optimization is then expressed as: 

max J—yy y(^)Tj^^j^T U) 

subject to y^^'>'^K,K-^y^'''> ==1, Vi = 1, . . . , m. 

The problem can be interpreted as maximizing covariance while constraining variance and magnitude of dual 
coefficients. 
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4.2 Computing several sets of canonical vectors 

Usually a one-dimensional representation does not sufficiently capture all the information in the data and 
higher dimensional subspaces are needed. After computing the first set of primal canonical vectors we 
proceed to computing the next set. The next set should be almost as highly correlated as the first one, but 
essentially "different" from the first one. We will achieve this by imposing additional constraints for every 
view, namely that all projection vectors in view i are uncorrelated with respect to Kf (similar as in two view 
regularized kernel CCA, see [17]). 

Let Y = [yi, . . . ,yk] S represent k sets of canonical vectors, where 

yWT^2yW^j^V^^l,...,m. 

The equation above states that each canonical vector has unit regularized variance and that different canon- 
ical vectors corresponding to the same view are uncorrelated (orthogonal with respect to Kf). 
We will now extend the set of constraints in the optimization (|5]) to enforce the orthogonality. 



-. rn m 



s — _ 

i—l j— ?+l 
T 



subject to y^^'^KiK, y^"^ =1, Vi = 1, 



(9) 



In order to use the Horst algorithm, we first use substitutions: 
Wc then define operators 

P, = Is- K,Y^'^Y^'^^K, = h - 

which map to the space orthogonal to the columns of KiY^'^K Each Pi is a projection operator: = Pi, 
which follows directly from the identities above. We restate the optimization problem in the new variables: 



. mm 

max y z^^^^K.r'^K^KjKf'rU^ 

subject to z(')^z(') = 1, Vi = 1, . . . , m 



^WT^W^Ofc, Vi = l,...,m. 
By using the projection operators, the optimization problem is equivalent to: 



m m 

max -J—yy z^^Tp^K, K,KjK~^P,z(=^ 



i—l 

s.t. z(*)^z(*) = 1, Vi = l,...,m. 

By multiplying the objective by 2 (due to symmetries of Pi, Ki and Ki) and shifting the objective function 
by Yz^i the problem is equivalent to: 



mm 

max -J—yy z^'^^P^K~^K,KjK~'Pz^^^ 

i=l j = l 



(11) 



1 ^ 

i=i 

s.t. z'')^z«=l, Vi = l,...,m 



15 



This last optimization can be reformulated as: 

max z'^ Az 

subject to z^'^'^z^'^ = 1, Vi m, 
where A £ R'"'* with block structure 6 (i) = s, Vi = 1, . . . , m, defined by: 



(12) 



^(.j)^J T^PfK, K^KjK, P, ioTi^j I 
I T^^^ forz-j J 

Lemma 4.2 The block matrix A defined above is positive semidefinite (i.e. A £ 

Proof A is symmetric, which follows from Pi = P/" and Ki = . Let z £ E™ '*. We will show that z^ Az > 
0. Let W ^ J2Zi z^'^^PfK.'^ {kK^ + "l^'I^^ K^^P^z^'\ Note that W > (each summand is 
positive-semidefinite) and W > if 3i : Piz'^^^ = z^'^\ 

mm 

z'Az = ^S^S^ z^^^^pTk^ K^K^K^P.z^^ 



^Az=^j:;^z^^^PjKr^K.K]K-'p, 

1=1 j=i 

-J m 

1 - K ^ ~ 

1=1 

mm 

1 hi 

1=1 

= Aee-^^^"^:'^"'^^^j^"^.-' 

j=i 

1=1 j=i 

^ ^ z^^^PlK' K.KjK-'p.z'^^ + 



s — _ 

1=1 

z^BB^z + iy > 0, 



where B £ M"' '"^^, defined by B^'^^ = —^l=(KiKi Pi)^ , with corresponding row block structure 6 (i) = 

s, Vi = 1, . . . , m. If ^ for some i, then the first inequality is strict (||i-iz'^')|j < Conversely, 

if Piz'-^^ = z^*' for all i, then W > 0, hence the last inequality is strict. | 

Since matrix A has all the required properties for convergence, we can apply the Algorithm [TJ Solutions 
to the problem (jOj are obtained by back-substitution y*^*) = Ki z^'^\ 
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4.3 Implementation 

The algorithm involves matrix vector multiplications and inverted matrix vector multiplications. If kernel 
matrices are products of sparse matrices: Ki = X^'^'^'^ X^^'^ with X'^^^ having sn elements where s « n, then 
kernel matrix vector multiplications cost 2ns instead of rP' . We omit computing the full inverses and rather 
solve the system KiX = y for x, every time K~^y is needed. Since regularized kernels are symmetric and 
multiplying them with vectors is fast (roughly four times slower as multiplying with original sparse matrices 
X'*^), an iterative method like conjugate gradient (CG) is suitable. Higher regularization parameters increase 
the condition number of each Ki which speeds up CG convergence. 

If we fix the number of iterations, maxiter, and number of CG steps, C, the computational cost of 
computing a /c-dimensional representation is upper bounded by: 0(C • maxiter ■ k"^ -m-n - s) , where m is the 
number of views, n the number of observations and s average number of nonzero features of each observation. 
Since the majority of computations is focused on sparse matrix-vector multiplications, the algorithm can 
easily be implemented in parallel (sparse matrices arc fixed and can be split into multiple blocks). 

So far we have assumed that the data is centered. Centering can efficiently be implemented on the fly 
with no changes in asymptotic computational complexity, but we will omit the technical details due to space 
constraints. 

5 Experiments 

We evaluated the SDP approach on two scenarios: performance analysis on synthetic data and performance 
on finding a common representation of a cross-lingual collection of documents. 

5.1 Synthetic data 

We generated several MCCA problem instances by varying the number of views and number the of dimensions 
per view in order to compare the performance of local search methods and the proposed SDP relaxation. 
The main purpose of the experiments was to sec under which conditions and how likely do the global bounds 
provide useful information. 

Let m denote the number of views (sets of variables) and rii denote the dimensionality of i-th view and 
N := J2i '^j- In s-ll cases we used the same number of dimensions per view {ni = n2 = ■ ■ ■ = n„i). We used 
three different methods to generate random correlation matrices. 

The first method, the random Gram matrices (see [13], [1]) , generates the correlation matrices by 
sampling N vectors vi, . . . ,Vn for a TV-dimensional multivariate Gaussian distribution (centered at the origin, 
with an identity covariance matrix), normalizing them and computing the correlation matrix C = [ci.jl^xAf 
as Cij :— v[ ■ Vj. The second method, the random spectrum, involves sampling the eigenvalues Ai, . . . , Ajv 
uniformly from a simplex (X]i!=i — ^) ^^'^ generating a random correlation matrix with the prescribed 
eigenvalues (see [1]). The final method, random 1-dim structures, involves generating a correlation 
matrix that has an approximately (due to noise) single-dimensional correlation structure. Here we generated 
a random m dimensional Gram matrix B, and inserted it into a. N x N identity matrix according to 
the block structure to obtain a matrix Cq (Set Co (hj) = ^ihj), where 5 is the Kronecker delta. Then 

for I, J = 1 . . . , m override the entries Co (^l + J2i=i '^ii 1 + J2i=i '^i) = ^ i^^ J)^ where we used 1-based 
indexing). We then generated a random Gram matrix D G M^^^ and computed the final correlation matrix 
as C = (1 — e) Co + e-D. In our experiments we set e = 0.001. The purpose of using a random spectrum 
method is that as the dimensionality increases, random vectors tend to be orthogonal, hence the experiments 
based on random Gram matrices might be less informative. As we will see later, the local method suffers the 
most when all = 1, which is an instance of a BQO problem. By using the approximately 1-dimensional 
correlation matrix sampling we investigated if the problem persists when Ui > 1. 

In all cases we perform a final step that involves computing the per-view Cholesky decompositions of 
variances and change of basis as we showed when we arrived to QCQP reformulation in equation ( |QCQP[ ). 
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Tabic 1: Random Gram matrix sampling 



(a) Possible duality gap 





Ui — 3 


ni — 2 


rii — 1 


m — 5 


0% 


5% 


17% 


m = 3 


0% 


0% 


9% 


(b) Local convergence 




m = 3 




Ui = 1 


m — 5 


1% 


5% 


48% 


m = 3 


0% 


1% 


26% 



(c) Local solution below lower SDP bound 





m = 3 


rii — 2 


Ui — 1 


m — 5 


0% 


0% 


14% 


m — 3 


0% 


0% 


12% 



The experiments are based on varying the number of sets of variables, to, and the dimensionality n^. 
For each sampling scenario and each choice of to and n,;, we generated 100 experiments, and computed 
1000 solutions based on Algorithm [1] the SDP solution (and respective global bounds), and looked at the 
frequencies of the following events: 

• a duality gap candidate detected (Tables[l] [5J [3] (a)), 

• local convergence detected (Tables [U [51 [21(b)) 

• when a local solution is worse than the SDP-based lower bound (Tables [I] [51 [31 (c)). 

The possibility of duality gap is detected when the best local solution is lower than 1% of the SDP bound. 
In this case the event indicates only the possibility of duality gap - it might be the case that further local 
algorithm restarts would close the gap. The local convergence event is detected when the objective value of 
two local solutions differs relatively by at least 10% and absolutely at least by 0.1 (both critcrions have to 
be satisfied simultaneously). Finally, the event of local solution being bellow the SDP lower bound means 
that it is bellow ^ of the optimal objective value of the SDP relaxation. 

We find that regardless of how we generated the data, the lower SDP bound is useful only when = 1 
(Table [l] [21 [3] (c)) and the results are similar for different choices of to. There are, however rare cases (less 
than 0.1%) where the lower bound is useful for rii = 2 and even rarer (less than 0.01%) for rii = 3. 

The chance of local convergence increases as the number of views to increases which can be consistently 
observed for all choices of rii and sampling strategies. Generating a problem where the local algorithm 
likely converges to a local solution is less likely as the dimensionality increases in the generic case (Tables [1] 
[3]). In the case of noisy embeddings of 1-dimcnsional correlation structures the dependence on rii behaves 
differently: the local convergence (see Table I3bp for the case (to = 5, = 3) is more likely than for the 
case (to = 5,71.; = 2), which is a curiosity (in the general case, increasing rii reduces that chance of local 
convergence, see Table [2bl Table [lb| . 

The relationship between to and rii and the possibility of duality gap behaves similarly as the local 
convergence - increasing to increases it and increasing rii decreases it (Table [Tal Table [5a]), except in the case 
of noisy 1-dim correlation structures, where we observe the same anomaly when = 2 (Table [3a]). 

To summarize, we illustrated the influence of to and rii on the performance of the Algorithm [1] and 
demonstrated that there exist sets of problems with nonzero measure where the SDP bounds give useful 
information. 
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Table 2: Random spectrum sampling 



(a) Possible duality gap 





Ui — 3 


ni — 2 


rii — 1 


m — 5 


0% 


5% 


36% 


m = 3 


0% 


1% 


20% 


(b) Local convergence 




m = 3 


«^ = 2 


Ui = 1 


m — 5 


1% 


3% 


50% 


m = 3 


0% 


0% 


31% 



(c) Local solution below lower SDP bound 





m = 3 


rii — 2 


Ui — 1 


m — 5 


0% 


0% 


15% 


m — 3 


0% 


0% 


16% 



Table 3: Random 1-dim structure sampling 
(a) Possible duality gap 





Hi — 3 


rii — 2 


Ui — 1 


m — 5 


24% 


16% 


23% 


m = 3 


7% 


4% 


7% 


(b) Local convergence 




Hi — 3 


Tlj = 2 


= 1 


m — 5 


9% 


6% 


51% 


m = 3 


0% 


0% 


31% 



(c) Local solution below lower SDP bound 





Ui — 3 


= 2 


Ui — 1 


m — 5 


0% 


0% 


13% 


m — 3 


0% 


0% 


15% 



5.2 Multilingual document collection 

Applications of canonical correlation analysis on collections of documents have been demonstrated in dimen- 
sionality reduction, cross-lingual document retrieval and classification [6] [5], extracting multilingual topics 
from text |25j , detecting bias in news [7] . In this section we explore the behavior of Algorithm [1] with re- 
spect to the global bounds. We will start by describing the data and then describe a method to reduce the 
dimensionality of the data in order to apply the SDP bounds. 

Data set and preprocessing Experiments were conducted on a subset of EuroParl, Release v3, [19], a 
multilingual parallel corpus, where our subset includes Danish, German, English, Spanish, Italian, Dutch, 
Portuguese and Swedish language. We first removed all documents that had one translation or more missing. 
Documents (each document is a day of sessions of the parliament) were then arranged alphabetically and 
split into smaller documents, such that each speaker intervention represented a separate document. We 
removed trivial entries (missing translation) and after that removed all documents that were not present in 
all eight languages. 

Thus we ended up with 12,000 documents per language. They roughly correspond to all talks between 
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2.25.1999 and 3.25.1999. We then computed the bag of words (vector space) [25] model for each language, 
where we kept all uni-grams, bi-grams and tri-grams that occurred more than thirty times. For example: 
" Mr" , " President" and " Mr .President" all occurred more than thirty times in the English part of the corpus 
and they each represent a dimension in the vector space. This resulted in feature spaces with dimension- 
ality ranging from 50,000 (English) to 150,000 (German). Finally we computed the tf-idf weighting and 
normalized every document for each language. We described how wc obtained corpus matrices X^"^^ for each 
language, where all the matrices have 12, 000 columns and the columns are aligned (X(^) {:,€) and X^^') (:,€) 
are a translation of each other). In section |4] we showed how to derive the QCQP problem, given a set of 
input matrices X^^\ 

Random projections and multivariate regression Applying the relaxation techniques to covariance 
matrices arising from text presents a scalability problem, since both the number of features (words in 
vocabulary) and number of documents can be large. Typical SDP solvers can find solutions to relaxed 
forms of QCQPs with up to a few thousand original variables. We now propose a method to address this 
issue. The main goal is to reduce the dimensionality of the feature vectors which results in tractable SDP 
problem dimensions. One way to analyze a monolingual document collection is to perform singular value 
decomposition on the corpus matrix, a technique referred to as Latent Semantic Indexing (LSI) [3]. A set 
of largest singular vectors can be used as a new document basis for dimensionality reduction. Expressing 
the documents with the basis of k largest singular vectors is optimal with respect to Frobenious norm 
reconstruction error. If computing the basis is too expensive, one can generate a random larger set of basis 
vectors that achieve similar reconstruction errors, a technique referred to as random projections. Although 
the random projection basis is not informative in the sense that LSI basis is (topics extracted by LSI reflect 
which topics are relevant for the corpus, as opposed to random topics), they can both achieve comparable 
compression qualities. 

A variant of LSI for multiple languages, Cross-Lingual LSI (CL-LSI)[4], first joins all feature spaces thus 
obtaining a single multilingual corpus matrix (single language corpus matrices are stacked together) . CL-LSI 
then proceeds as standard LSI by computing the singular value decomposition of the multilingual corpus 
matrix. Applying random projections instead of SVD does not work directly; random multilingual topic 
vectors destroy cross lingual information: a fact which can be observed experimentally. 

Our approach is based on the following idea. Generate a set of random vectors for one language and 
use Canonical Correlation Analysis Regression (CCAR)[24] (a method similar to ridge regression) to find 
their representatives in the other languages. Repeat the procedure for each of the remaining languages to 
prevent bias to a single language. We hypothesize that restricting our search in the spaces spanned by the 
constructed bases still leads to good solutions. The procedure is detailed in Algorithm O 

Let m be the number of vector spaces corresponding to different languages and rii the dimensionality of 
the i — th vector space. Let X*^'^ G ]g"ixAr represent the aligned document matrix for the i-th language. 

The matrices . ■ . , P(i,m) form the bases of vector spaces corresponding to X'^', . . . , AT^™). Let 

Pi := . . . ,P(m_i)] denote the full basis for the i-th language. We now experimentally address two 

questions: docs the restricted space enable us to find stable patterns and what do the SDP bounds tell us. 
Experimental protocol The experiments were conducted on the set of five EuroParl languages: English, 
Spanish, German, Italian and Dutch. Wc set fc = 10 which corresponds to Ui = 50 dimensions per view, 
so the QCQP matrix will be of size 250 x 250. Wc randomly select 5000 training documents and 1000 
test documents. For a range of random projection regularization parameters 7, we compute the mappings 
Pi (based on the train set) and reduce the dimensionality of the train and test sets. Then, for a range of 
QCQP regularization parameters k, we set up the QCQP problem, compute 1000 local solutions (by Horst 
algorithm) and solve the SDP relaxation. The whole procedure is repeated 10 times. 

For each (7, k) pair we measured the sum of correlations on the test and train sets. In Table |4] we report 
sums of correlations averaged over 10 experimental trials. The maximal possible sum of correlations for five 
datasets equals to (2) = 10. We observe that regularizing the whole optimization problem is not as vital as 
regularizing the construction of random projection vectors. This is to be expected since finding the random 
projection vectors involves a regression in a high dimensional space as opposed to solving a lower dimensional 
QCQP. Selecting 7 = 0.1 leads to perfectly correlated solutions on the training set for all n. This turns out 
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Algorithm 2 Random projections basis generation 

Input: matrices X'-^\ . . . X^"^\ 7 - the regularization coefficient, k - the number of projections per 
block 

for z = 1 to m do 

■— random rii x k matrix where each element is sampled i.i.d. from standard normal distribution. 
Re-scale each column of P{i^i) so that its norm is equal to y'^. 
for j = 1 to m do 
ii j = i then 

continue 
end if 

P{ij) ■= a(i.j)X'^'')X(*)"^P(j j) , where Ij is the Uj x rij identity matrix, 
end for 
end for 

Output: matrices P(i.j) for i,j = l,...,m 

Table 4: Train and test sum of correlation 



(a) Train set sum of correlations 





7 =0.1 


7 =0.5 


7 =0.9 


7 =0.99 


K =0.01 


10.0 


9.8 


9.8 


9.8 


K =0.1 


10.0 


9.8 


9.8 


9.8 


K =0.5 


10.0 


9.8 


9.8 


9.8 


K =0.9 


10.0 


9.8 


9.8 


9.8 


K =0.99 


10.0 


9.8 


9.7 


9.8 



(b) Test set sum of correlations 





7 =0.1 


7 =0.5 


7 =0.9 


7 =0.99 


K =0.01 


5.8 


8.6 


9.6 


9.8 


K =0.1 


6.2 


8.6 


9.6 


9.8 


K=0.5 


7.0 


8.6 


9.6 


9.8 


K =0.9 


7.4 


8.8 


9.6 


9.8 


K =0.99 


7.4 


8.8 


9.6 


9.8 



to be over-fitted when we evaluate the sum of correlations on the test set. Note that higher k values in 
this case improve the performance on the test set but only up to a certain level below 7.5. As we increase 
7 to 0.5, we see a reduction in overfitting and 7 = 0.9 results in comparable performance on the test and 
train sets (the patterns are stable). We have demonstrated a technique to reduce the dimensionality of the 
original QCQP problem which still admits finding stable solutions. The reduced dimensionality enables us 
to investigate the behavior of the SDP relaxation. 

For the SDP bounds we observed behavior that was similar to the high-dimcnsional synthetic (generic) 
case. That is we found that the potential duality gap was very small and that the SDP and the Horst 
algorithm yielded the same result. For this reason we omit the SDP results from Table H] 

6 Discussion 

In the paper we studied a generalization of CCA to more than two sets of variables. We showed that the 
complexity of the problem is NP-hard and described a locally convergent method as well as presented how 



21 



to generalize the method to the nonhnear case with several canonical variates. Experimentally, we observe 
that the performance of the local method (with linear convergence) is generally good, although we identified 
problem settings where the local method can be far from global optimality. We presented a SDP relaxation 
of the problem, which can be used to obtain new local solutions and to provide certificates of optimality. The 
usefulness of the bounds was tested on synthetic problem instances and a problems related to cross-lingual 
text mining. The high dimensional nature of documents and the size of the document collections result in 
untractable memory requirements. We solved the issue by proposing a preprocessing step based on random 
projections. 

Future work includes analyzing the complexity of the other generalizations proposed in |18) . We found 
that noisy 1-dimensional embeddings present difficulties for the local approach as opposed to generic problem 
structures. A natural question is, are there other problem structures that result in suboptimal behavior of 
the local approach. We presented result based on textual data, however, this setting appears in many setting 
and we plan to extend the list of applications to include other modalities, such as images, sensor streams 
and graphs (social media analysis). 
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A Notation 

This section reviews the notation used throughout the paper. 

• Column vectors are denoted by lowercase letters, e.g. x. 

• Matrices are denoted by uppercase letters, e.g. X. 

• Constants will be denoted by letters of Greek alphabet, e.g. a. 

• Row vectors and transposed matrices are denoted by x'^ and X"^ respectively. 
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Subscripts are used to enumerate vectors or matrices, e.g. xi,X2, Xi, except in the special case of the 
identity matrix, /„ and the zero matrix Ok,i- In these cases, the subscripts denote row and cohimn 
dimensions. 

Parentheses next to vectors or matrices are used to denote specific elements: x{i) denotes the i-th 
element of vector x and X(i,j) denotes the element in the i-th row and j-th column of matrix X. 

Notation X{:,j) denotes the j-th column of X and X{i,:) denotes the i-th row (This is MATLAB 
notation and similar to the notation used in PJJJ). 

Parenthesis are also used to explicitly denote the components of a row vector: 

X- (x(l),x(2),...,a;(iV)), 

where x is an A^-dimensional vector. 

Let R" denote the n-dimensional real vector space and M"^™ denote the {n ■ m)-dimensional vector 
space used when specifying matrix dimensions and let N denote the natural numbers. 

Let §^ denote the space of symmetric positive definite n-hy-n matrices. 

Let double-struck capital letters denote vector spaces, e.g. X. 

Horizontally concatenating two matrices with the same number of rows, A and B, is denoted by [A B], 
e.g. stacking two column vectors xi and X2 vertically is denoted by [xf a;|"]"^. 

Superscripted indices in parenthesis denote sub-blocks of vectors or matrices corresponding to a vector 
encoding the block structure: b = (rii, . . . , n™) where g N for i 1, . . . , m. We use x^^^ to denote 
the i-th sub-column of the vector x, which by using the block structure b and one-based indexing 
corresponds to 

Matrix block notation A^*'-'^ represents i-th row block and j-th column block with respect to the block 
structure b. Using a single index in matrix block notation, A''^ , denotes a block row (column dimension 
of A^*) equals the column dimension of A). 

Let Square(-) denote componentwise squaring: if y = Square(x) then y{i) = x(i)^ and let diag(X) 
denote the vector corresponding to the diagonal of the matrix X. 
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